The dataset is compromised of 23 variables, and is a timeseries of australian weather, which the purpose of predicting whether it would rain tomorrow.
Date- Categorical variable, when the measurements were taken Location - Categorical variable, where the measurements were taken MinTemp - Numerical variable, minimal temperature observed that day MaxTemp - Numerical variable, maximal temperature observed that day Rainfall - Evaporation - Sunshine - WindGustDir - WindGustSpeed - WindDir9am - WindDir3pm - WindSpeed9am - Windspeed3pm - Humidity9am - Humidity3pm - Pressure9am - Pressure3pm - Cloud9am - Cloud3pm - Temp9am - Numerical variable, temperature in Celsius at 9am Temp3pm - Numerical variable, temperature in Celsius at 3pm RainToday - Categorical variable, whether it rained today or not RainTomorrow - Categorical variable, whether tomorrow will rain or not
We perform a basic visualization, first of the correlation between variables, which isn’t significant with the exception of variables recorded in the same day, that is, those measurements taken at 9am and 3pm, this helps us see that there’s an important temporal component in the same day. We perform a special method of data imputation, following the timeseries plot of the WinDir9am, WindDir3pm and WindGustDir, we can see that if the last day was a certain category, it will probably be that same category. So we choose this as our method of imputation for categorical NAs.
## INFO [2021-06-02 16:14:38] Date has been selected as the timestamp column
## INFO [2021-06-02 16:14:38] has been selected as the numeric column(s)
## INFO [2021-06-02 16:14:38] WindDir9am, WindDir3pm, WindGustDir has been selected as the state column(s)
## INFO [2021-06-02 16:14:38] creating state plot layers
We remove the columns with over 30% NAs, as imputation might be too imprecise when over a third of data is missing, and dropping 30% of data might be too excesive. We also remove all NAs, which are 2% from RainToday and RainTomorrow, as RainTomorrow is the variable to predict, and any imputation will change the real space, and RainToday because it is highly rellated to RainTomorrow and might worsen our prediction. To reduce the effect of the temporality of data we transform Date into the new variable Season, which is an approximation of the season to which the date belongs to.
## Location MinTemp MaxTemp Rainfall
## Length:140787 Min. :-8.50000 Min. :-4.80000 Min. : 0.000000
## Class :character 1st Qu.: 7.60000 1st Qu.:17.90000 1st Qu.: 0.000000
## Mode :character Median :12.00000 Median :22.60000 Median : 0.000000
## Mean :12.18482 Mean :23.23512 Mean : 2.349974
## 3rd Qu.:16.80000 3rd Qu.:28.30000 3rd Qu.: 0.800000
## Max. :33.90000 Max. :48.10000 Max. :371.000000
## NA's :468 NA's :307
## Evaporation Sunshine WindGustDir WindGustSpeed
## Min. : 0.00000 Min. : 0.00000 Length:140787 Min. : 6.00000
## 1st Qu.: 2.60000 1st Qu.: 4.90000 Class :character 1st Qu.: 31.00000
## Median : 4.80000 Median : 8.50000 Mode :character Median : 39.00000
## Mean : 5.47252 Mean : 7.63054 Mean : 39.97052
## 3rd Qu.: 7.40000 3rd Qu.:10.70000 3rd Qu.: 48.00000
## Max. :145.00000 Max. :14.50000 Max. :135.00000
## NA's :59694 NA's :66805 NA's :9105
## WindDir9am WindDir3pm WindSpeed9am WindSpeed3pm
## Length:140787 Length:140787 Min. : 0.0000 Min. : 0.00000
## Class :character Class :character 1st Qu.: 7.0000 1st Qu.:13.00000
## Mode :character Mode :character Median : 13.0000 Median :19.00000
## Mean : 13.9905 Mean :18.63114
## 3rd Qu.: 19.0000 3rd Qu.:24.00000
## Max. :130.0000 Max. :87.00000
## NA's :1055 NA's :2531
## Humidity9am Humidity3pm Pressure9am Pressure3pm
## Min. : 0.00000 Min. : 0.00000 Min. : 980.500 Min. : 977.100
## 1st Qu.: 57.00000 1st Qu.: 37.00000 1st Qu.:1013.000 1st Qu.:1010.400
## Median : 70.00000 Median : 52.00000 Median :1017.600 Median :1015.200
## Mean : 68.82683 Mean : 51.44929 Mean :1017.655 Mean :1015.258
## 3rd Qu.: 83.00000 3rd Qu.: 66.00000 3rd Qu.:1022.400 3rd Qu.:1020.000
## Max. :100.00000 Max. :100.00000 Max. :1041.000 Max. :1039.600
## NA's :1517 NA's :3501 NA's :13743 NA's :13769
## Cloud9am Cloud3pm Temp9am Temp3pm
## Min. :0.00000 Min. :0.00000 Min. :-7.20000 Min. :-5.40000
## 1st Qu.:1.00000 1st Qu.:2.00000 1st Qu.:12.30000 1st Qu.:16.60000
## Median :5.00000 Median :5.00000 Median :16.70000 Median :21.10000
## Mean :4.43116 Mean :4.49925 Mean :16.98707 Mean :21.69318
## 3rd Qu.:7.00000 3rd Qu.:7.00000 3rd Qu.:21.60000 3rd Qu.:26.40000
## Max. :9.00000 Max. :9.00000 Max. :40.20000 Max. :46.70000
## NA's :52625 NA's :56094 NA's :656 NA's :2624
## RainToday RainTomorrow Season
## Length:140787 Length:140787 winter:33981
## Class :character Class :character spring:37027
## Mode :character Mode :character summer:35526
## fall :34253
##
##
##
We perform the imputation of the missing continous data, however, to avoid data leakage from train into test, we separate the data into train and test, and build the imputation MICE predictive mean model on the train data, and apply it to both train and test.
We plot the density distributions of the data, we can observe a gaussian distribution in MinTemp, MaxTemp, Humidity3pm, Temp9am and Temp3pm. A mixture of gaussians can be observed in Humidity9am, and, if we consider each peak in the WindSpeed9am and WindSpeed3pm a gaussian, a extreme version of a mixture of gaussians is present in these variables. All the categorical variables, with the exception of RainTomorrow and RainToday have mostly equal distributions, the only major imbalance being in these two variables.
Rainfall does not conform to a Gaussian distribution, and a transformation must be applied specifically for it.
A logarithmic transformation is applied to the rainfall variable, adding a constant value of 1 to deal with zeroes, this is to get Rainfall to a shape closer to a Gaussian, being the variable most far from a Gaussian distribution.
We scale the data to a mean of 0 and variance of 1, so as to be compatible with methods sensible to distance metrics.
Our new data retains its original shape with the exception of Rainfall, which, even when transformed, is still far away from a Gaussian distribution, but it is however, closer to it.
While there appear to be some outliers, all the outliers in the boxplot almost in its entirety are extremely close together, suggesting highly skewed distributions, not outliers.
Train and test sets are separated for further use in the classification section.
To make fiesable in my computer the analysis the dataset have been sampled
first we use numerical variables (except location, wind direction, season) to apply lda.
## Call:
## lda(RainTomorrow ~ ., data = scaled[, -c(1, 5, 7, 8, 17, 19)])
##
## Prior probabilities of groups:
## No Yes
## 0.779 0.221
##
## Group means:
## MinTemp MaxTemp Rainfall WindGustSpeed WindSpeed9am
## No -0.03129009203 0.1381194356 -0.1711179629 -0.1037444468 -0.04033780379
## Yes 0.19883140776 -0.2748446964 0.5284213679 0.4497335562 0.16545824105
## WindSpeed3pm Humidity9am Humidity3pm Pressure9am Pressure3pm
## No -0.02108994269 -0.2116395863 -0.3277774578 0.1669145290 0.1402097850
## Yes 0.22904496911 0.3911135215 0.7715055231 -0.5273003287 -0.4889403722
## Temp9am Temp3pm
## No 0.05393326701 0.1681473091
## Yes -0.01444686437 -0.3154724759
##
## Coefficients of linear discriminants:
## LD1
## MinTemp 0.2497060858
## MaxTemp 0.5175672616
## Rainfall 0.2608917948
## WindGustSpeed 0.5934534923
## WindSpeed9am -0.1192468627
## WindSpeed3pm -0.2262787930
## Humidity9am -0.2078587237
## Humidity3pm 1.0095116531
## Pressure9am 0.8742397009
## Pressure3pm -1.2728798720
## Temp9am -0.4000923765
## Temp3pm -0.4751598989
Prior probabilities of groups defines the prior probability of the response classes for an observation. This shows 77.84 % of rain tomorrow and 22.16 % of not rain tomorrow.
Group Means defines the mean value (µk) for response classes for a particular X=x. This indicates means values of different features when they fall to a particular response class.
We see a clear difference between all the variables: they have opposite mean values for class RainTomorrow class. Especially for Humidity3pm, Humidity9am, Rainfall,Pressure9am, their absolute values vary greatly. The more the difference between mean, the easier it will be to classify observation. We can assume humidity, rainfall, pressure have more impact on the probabilities of rain on the second day; while temperature on 9am and minimum temperature have less impact.
## [1] 0.854
## RainTomorrow
## Predicted No Yes
## No 733 100
## Yes 46 121
The below plot shows how the response class has been classified by the LDA classifier. The X-axis shows the value of line defined by the co-efficient of linear discriminant for LDA model. The two groups are the groups for response classes.
The below figure shows how the data has been classified. The Predicted Group-No and Group-Yes has been colored with actual classification with red and blue color. The mix of color in the Group shows the incorrect classification prediction.
apply pca only on the numerical variables
Divide variables into 8 group.
only use categorical variables to apply mca, RainToday and RainTomorrow as supplementary variables
#Clustering
In the following chunk of code a tiny data pre processing will be applied to the dataset in order to prepare it to execute few clustering algorithms on top of it. To apply the clustering algorithms below the input dataset must be composed by numeric variables, therefore not numeric data will be discarded. The analysis will be performed considering just climatic descriptors.
The first approach with clustering method have been with the traditional partition methodology applying K-Means algorithm, since is the computationally less expensive technique. The algorithm have been executed, looking for 2, 3, 4 and 5 clusters (centers = x) in order to look for some likely shapes of the clusters. It is plain that datas have the hape of a cloud, therefore it is not going to be possible distinguish clean clusters.
To determine the optimal number of clusters we adopted the silhouette method, with the respective code method = "silhouette". The output suggest an optimal number of clusters equal to two.
The object of our analysis then will be based on this plot.
As the silhouette method suggested will be studied the clustering with k equals to 2. For the interpretation of the obtained results, showing the centers k2$centers will help to associate each cluster to particular feature. It is clear that the first cluster (1) is more representative for the high temperature sampling while the second cluster (2) is more representative for the low temperatures. High temperature cluster and low temperature cluster differ also in term of humidity and pressure, presenting respectively low and high values.
## MinTemp MaxTemp Rainfall WindGustSpeed WindSpeed9am
## 1 -0.6508956243 -0.7300318512 0.1316102290 -0.2496811333 -0.2159006832
## 2 0.6948146271 0.8066280912 -0.1583370081 0.2420988455 0.2230822078
## WindSpeed3pm Humidity9am Humidity3pm Pressure9am Pressure3pm
## 1 -0.2292815024 0.4375383565 0.3162830386 0.4092452476 0.4748082130
## 2 0.3256221658 -0.5594113195 -0.4315550132 -0.4924192670 -0.5688877065
## Temp9am Temp3pm
## 1 -0.7164775017 -0.700733183
## 2 0.8080650099 0.805604279
Another trial to identify other kind of clusters shapes have been done applying a mixed approach, using a hierarchical clustering to determine the shape of clusters. The number of clusters will be specified by the parameter k=4. This time we will observe the characteristic of four different clusters.
Adopting a higher number of cluster is easier to notice a higher variation in term of clusters specialization. The most important cluster in this analysis is clearly the number 3 since it is represented by a high value of the Rainfall attribute and therefore it is representing the rainy days, that are very important for our analysis, since the goal of the following prediction phase will be focused on classify correctly the variable Raintomorrow. According with this cluster, rainy days are characterized by high wind values and low pressure and temperatures.
## MinTemp MaxTemp Rainfall WindGustSpeed WindSpeed9am
## 1 1.1670099003 1.3188507646 -0.24550873056 0.4978639921 0.3890422658
## 2 -0.1005353883 -0.6058158347 0.77189658866 0.8055273653 0.7137216930
## 3 -0.9957310840 -0.9748379351 -0.06980687292 -0.6216590987 -0.5333489238
## 4 0.1327010344 0.4313491410 -0.38231612681 -0.4435639289 -0.3517579894
## WindSpeed3pm Humidity9am Humidity3pm Pressure9am Pressure3pm
## 1 0.6086365708 -0.8740941188 -0.6658488159 -0.7912123743 -0.91693149091
## 2 0.7053517398 0.3626443953 0.4484736763 -0.5864035319 -0.44542230317
## 3 -0.5290437165 0.6216849951 0.4065291537 0.9320913725 0.99134069540
## 4 -0.3519175600 -0.3507035691 -0.3752798884 0.1012723645 0.02021766938
## Temp9am Temp3pm
## 1 1.3385931840 1.3140860198
## 2 -0.3530748557 -0.5950152417
## 3 -1.0386737571 -0.9436046064
## 4 0.3091782299 0.4475020009
Since the biggest part of the dataset shows a gaussian distribution, a Gaussian finite mixture model fitted by EM algorithm should achieve good results in terms of clustering. Gaussian mixture produced as output five clusters of shape VEV.
## [1] 3
## [1] "VEV"
Finally let’s interpret the output of the clustering. Even this time there is one cluster over representative for the variable rainfall, presenting even higher value than before. As before the features presented by rainy days are almost the same, with the difference that this time the humidity is way higher but than before but the pressure is not that low.
## [,1] [,2] [,3]
## MinTemp 0.04239902300 -0.006341010582 -0.0577299312253
## MaxTemp 0.31796653010 -0.332804452704 -0.5337275830811
## Rainfall -0.53655048982 0.856034922057 0.8997121262176
## WindGustSpeed -0.15688786499 0.844645239513 -0.0142234681895
## WindSpeed9am -0.08023932089 0.471916439262 -0.0090454702870
## WindSpeed3pm -0.05681496572 0.579489078813 0.0563116141571
## Humidity9am -0.37552526202 0.355070418642 0.5603651060738
## Humidity3pm -0.39570310708 0.510288700806 0.5471268091921
## Pressure9am 0.05242019853 -0.598805341416 -0.0006673402984
## Pressure3pm -0.02721738097 -0.575499088274 0.1678063887525
## Temp9am 0.19675844051 -0.127288745935 -0.2965300869339
## Temp3pm 0.33483280468 -0.446471376486 -0.4694766935272